Spam Blog Filtering with Bipartite Graph Clustering and Mutual Detection between Spam Blogs and Words

نویسنده

  • Kazunari Ishida
چکیده

This paper proposes a mutual detection mechanism between spam blogs and words with bipartite graph clustering for fi ltering spam blogs from updated blog data. Spam blogs are problematic in extracting useful marketing information from the blogosphere; they often appear to be rich sources of information based on individual opinion and social reputation. One characteristic of spam blogs is copied-and-pasted articles based on normal blogs and news articles. Another is multiple postings of the same article to increase the chances of exposure and income from advertising. Because of these characteristics, spam blogs share common words, and such blogs and words can form large spam bi-clusters. This paper explains how to detect spam blogs and spam words with mutual fi ltering based on such clusters. It reports that the maximum precision, or F-measure, of the fi ltering is 95%, based on a preliminary experiment with approximately six months' updated blog data and a more detailed experiment with one day's data. An advantage of this method for spam blog fi ltering, as compared to a machine learning approach, is also supported by experiments with SVM.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Based Comment Spam Defending Tool

Spam messes up user’s inbox, consumes network resources and spread worms and viruses. Spam is flooding of unsolicited, unwanted e mail. Spam in blogs is called blog spam or comment spam.It is done by posting comments or flooding spams to the services such as blogs, forums,news,email archives and guestbooks. Blog spams generally appears on guestbooks or comment pages where spammers fill a commen...

متن کامل

Library blogs and user participation: a survey about comment spam in library blogs

Purpose The purpose of this research is to identify and describe the impact of comment spam in library blogs. Three research questions guided the study: current level of commenting in library blogs; librarians' perception of comment spam; and techniques used to address the comment spam problem. Design/methodology/approach A quantitative approach is used to investigate research questions. Inform...

متن کامل

Spam Source Clustering by Constructing Spammer Network with Correlation Measure

Spam filtering is one of the most challenging problems in electric message systems. In general, recent studies on specifying real spam source are based on content filtering because spammers usually falsify their origin. We propose a method to specify spam source based on structural analysis with complex network. We assume that each spam sources either has the same victim list or uses the same s...

متن کامل

Spam Filtering Based on Supervised Latent Semantic Features Extraction

Spam text is an universal phenomenon on the “open web”, including large-scale email systems and the growing number of Blogs. Handling this information overload is becoming an increasingly challenging problem, A promising approach is the using of content-based filtering. In this paper, our focus is placed on finding effective dimension reduction method for email Spam filtering, we apply a superv...

متن کامل

Blog Track Open Task: Spam Blog Classification

Spam blogs or Splogs are blogs with either auto-generated or plagiarized content created for the sole purpose of hosting ads, promoting affiliate sites and getting new pages indexed. Splogs now rival generic web spam and e-mail spam, presenting a major problem to analytics on the blogosphere from basic search and indexing, to opinion, community, influence and correlation detection. This open ta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JDIM

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2010